Semantic Classification of Chinese Unknown Words
نویسنده
چکیده
This paper describes a classifier that assigns semantic thesaurus categories to unknown Chinese words (words not already in the CiLin thesaurus and the Chinese Electronic Dictionary, but in the Sinica Corpus). The focus of the paper differs in two ways from previous research in this particular area. Prior research in Chinese unknown words mostly focused on proper nouns (Lee 1993, Lee, Lee and Chen 1994, Huang, Hong and Chen 1994, Chen and Chen 2000). This paper does not address proper nouns, focusing rather on common nouns, adjectives, and verbs. My analysis of the Sinica Corpus shows that contrary to expectation, most of unknown words in Chinese are common nouns, adjectives, and verbs rather than proper nouns. Other previous research has focused on features related to unknown word contexts (Caraballo 1999; Roark and Charniak 1998). While context is clearly an important feature, this paper focuses on non-contextual features, which may play a key role for unknown words that occur only once and hence have limited context. The feature I focus on, following Ciaramita (2002), is morphological similarity to words whose semantic category is known. My nearest neighbor approach to lexical acquisition computes the distance between an unknown word and examples from the CiLin thesaurus based upon its morphological structure. The classifier improves on baseline semantic categorization performance for adjectives and verbs, but not for nouns.
منابع مشابه
Hybrid Models for Semantic Classification of Chinese Unknown Words
This paper addresses the problem of classifying Chinese unknown words into fine-grained semantic categories defined in a Chinese thesaurus. We describe three novel knowledge-based models that capture the relationship between the semantic categories of an unknown word and those of its component characters in three different ways. We then combine two of the knowledge-based models with a corpus-ba...
متن کاملComputing the Sentiment Polarity of Chinese Words and Sentences
This paper reports on experiments with a newly available sentiment classification test collection for Chinese. Detection of negation during training and classification is shown to improve the accuracy of character-based classification for the semantic orientation of individual Chinese words. Using the resulting classifier for unknown words resulted in substantial improvements in classification ...
متن کاملThe Identification and Classification of Unknown Words in Chinese An N-Grams-Based Approach
In this paper, we propose a new approach to identify unknown words in Chinese. This approach adopts an n-grams program to sort out the collocating word / character sequences which are possible words and phrases in Chinese. In addition to proposing the criteria for identifying Chinese new words, was also classify these new words according to their structural and semantic characteristics. The cor...
متن کاملAn Empirical Study of Unsupervised Semantic Classification of Chinese Reviews
This paper is an empirical study of unsupervised semantic classification of Chinese reviews. The focus is on exploring the ways to improve the performance of the unsupervised semantic classification based on limited existing semantic resources in Chinese. On the one hand, all available Chinese semantic lexicons — individual and combined — are evaluated under our proposed framework. On the other...
متن کاملHybrid Models for Chinese Unknown Word Resolution Dissertation
Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps. This dissertation is concerned with developing hybrid models that effectively combine statistic...
متن کامل